Recognizing arbitrary multi-character text in unconstrained naturalphotographs is a hard problem. In this paper, we address an equally hardsub-problem in this domain viz. recognizing arbitrary multi-digit numbers fromStreet View imagery. Traditional approaches to solve this problem typicallyseparate out the localization, segmentation, and recognition steps. In thispaper we propose a unified approach that integrates these three steps via theuse of a deep convolutional neural network that operates directly on the imagepixels. We employ the DistBelief implementation of deep neural networks inorder to train large, distributed neural networks on high quality images. Wefind that the performance of this approach increases with the depth of theconvolutional network, with the best performance occurring in the deepestarchitecture we trained, with eleven hidden layers. We evaluate this approachon the publicly available SVHN dataset and achieve over $96\%$ accuracy inrecognizing complete street numbers. We show that on a per-digit recognitiontask, we improve upon the state-of-the-art, achieving $97.84\%$ accuracy. Wealso evaluate this approach on an even more challenging dataset generated fromStreet View imagery containing several tens of millions of street numberannotations and achieve over $90\%$ accuracy. To further explore theapplicability of the proposed system to broader text recognition tasks, weapply it to synthetic distorted text from reCAPTCHA. reCAPTCHA is one of themost secure reverse turing tests that uses distorted text to distinguish humansfrom bots. We report a $99.8\%$ accuracy on the hardest category of reCAPTCHA.Our evaluations on both tasks indicate that at specific operating thresholds,the performance of the proposed system is comparable to, and in some casesexceeds, that of human operators.
展开▼